FEAST：快速准确的微生物来源追溯工具

Original 宏基因组宏基因组 2022-05-08

不服就干！FEAST Source Tracker：快速准确的微生物来源追溯工具百分百畅通版~

广东省科学院徐锐环境微生物方向

最近帮老板做项目想试下source tracker，刚好看到公众号推送了一篇相关攻略。不过实测以后发现“作者GitHub原版”和“文涛聊科研版”都有一些bug老是跑不通，于是自己仔细捋顺了一遍，希望能抛砖引玉，帮助小白们速入门~

一、感谢前人相关工作，同时大家也可对比一下~

1.【GitHub原版链接】

https://github.com/cozygene/FEAST （点击右侧绿色Download打包下载）

2.【文涛版链接】

3.【宏基因组入门介绍】

二、实测发现的Bugs

1.原版脚本中没有提供所依赖的各包，对小白不友好~

2.所需文件格式txt/csv没统一，换自己的数据时非常容易在Line68求和时报错：

（# COVERAGE = min(rowSums(otus[c(train.ix, test.ix),]))），很可能是面前表头读取错误导致。

3.原版测试数据data frame的行列个数有1处不对（其余都是4，有1个是3），导致后续将结果as.data.frame时报错。

图。原数据的bug

图。将原数据的计算结果as.data.frame的bug

4.文涛版提供了很好的结果可视化脚本，但可惜没有提供测试数据。

三、修改使用说明

1.将本修订脚本文件夹打包复制到工作目录下，其中已含有原版可供参考

2.菌群潜在来源定义为“Source”，待计算目标定义为“Sink”。

3.FEAST提供了2组脚本（计算1个sink时选“FEAST_example_one_sink.R”；计算多个sink时选“FEAST_example_Multiple_sinks.R”），2组脚本都依赖主程序“FEAST_scr/src.R”

4.其中多个sink的脚本需要设置参数“different_sources_flag”（每个sink的source不同时=1，相同时=0），具体选择流程图如下：

图。1个sink时的选择方式

图。多个sink时的选择方式

5.修改后的脚本统一使用csv数据格式

四、脚本注释与说明

1.几个重要的Argument

2.所需文件：meta_data（分组文件，要求见上流程图）和otu_data（样品的otu丰度表），在脚本中改好对应的文件名即可。

3.以最复杂的Multiple_sinks脚本为例，解析如下：

#加载必要的包

library("vegan")

library("dplyr")

library("doParallel")

library("foreach")

library("mgcv")

library("reshape2")

library("ggplot2")

library("cowplot")

library("Rcpp")

library("RcppArmadillo")

#准备

rm(list = ls())

gc()

#所有需要设置的变量，共3个

#Set the arguments of your data设置计算数据，最好都统一为csv格式

metadata_file = "all_meta.csv" #分组信息

count_matrix = "all_otu.csv" #otu表

EM_iterations = 1000 #default value=1000

##if you use different sources for each sink, different_sources_flag = 1, otherwise = 0

different_sources_flag = 0

# Load main code加载主程序

print("Change directory path")

dir_path = paste("C:/ ...your path/ FEAST/") #修改成FEAST文件夹所在目录

setwd(paste0(dir_path, "FEAST_src"))

source("src.R")

# Load sample metadata加载数据，仍旧统一为csv格式

setwd(paste0(dir_path, "Data_files"))

metadata <- read.csv(metadata_file, header=T, sep = ",", row.names = 1)

# Load OTU table

otus <- read.csv(count_matrix, header=T, sep = ",", row.names = 1)

otus <- t(as.matrix(otus))

#计算过程，不用管

# Extract only those samples in common between the two tables

common.sample.ids <- intersect(rownames(metadata), rownames(otus))

otus <- otus[common.sample.ids,]

metadata <- metadata[common.sample.ids,]

# Double-check that the mapping file and otu table

# had overlapping samples

if(length(common.sample.ids) <= 1) {

message <- paste(sprintf('Error: there are %d sample ids in common '),

'between the metadata file and data table')

stop(message)

}

if(different_sources_flag == 0){

metadata$id[metadata$SourceSink == 'Source'] = NA

metadata$id[metadata$SourceSink == 'Sink'] = c(1:length(which(metadata$SourceSink == 'Sink')))

}

envs <- metadata$Env

Ids <- na.omit(unique(metadata$id))

Proportions_est <- list()

for(it in 1:length(Ids)){

# Extract the source environments and source/sink indices

if(different_sources_flag == 1){

train.ix <- which(metadata$SourceSink=='Source' & metadata$id == Ids[it])

test.ix <- which(metadata$SourceSink=='Sink' & metadata$id == Ids[it])

}

else{

train.ix <- which(metadata$SourceSink=='Source')

test.ix <- which(metadata$SourceSink=='Sink' & metadata$id == Ids[it])

}

num_sources <- length(train.ix)

COVERAGE = min(rowSums(otus[c(train.ix, test.ix),])) #Can be adjusted by the user

str(COVERAGE)

# Define sources and sinks

sources <- as.matrix(rarefy(otus[train.ix,], COVERAGE))

sinks <- as.matrix(rarefy(t(as.matrix(otus[test.ix,])), COVERAGE))

print(paste("Number of OTUs in the sink sample = ",length(which(sinks > 0))))

print(paste("Seq depth in the sources and sink samples = ",COVERAGE))

print(paste("The sink is:", envs[test.ix]))

# Estimate source proportions for each sink

FEAST_output<-FEAST(source=sources, sinks = t(sinks), env = envs[train.ix], em_itr = EM_iterations, COVERAGE = COVERAGE)

Proportions_est[[it]] <- FEAST_output$data_prop[,1]

names(Proportions_est[[it]]) <- c(as.character(envs[train.ix]), "unknown")

if(length(Proportions_est[[it]]) < num_sources +1){

tmp = Proportions_est[[it]]

Proportions_est[[it]][num_sources] = NA

Proportions_est[[it]][num_sources+1] = tmp[num_sources]

}

print("Source mixing proportions")

print(Proportions_est[[it]])

}

print(Proportions_est)#原版仅可得到这个数据，可视化程度较差

#可视化过程，参考文涛脚本并修正若干bug

#输出计算结果

FEAST_output = as.data.frame(Proportions_est)

colnames(FEAST_output) = paste("repeat_",Ids,sep = "") #取Ids作为平行代号

head(FEAST_output)

filename = paste(dir_path,"Result/FEAST.csv",sep = "")

write.csv(FEAST_output,filename,quote = F)

#简单出图(每个repeat一张)

library(RColorBrewer)

library(dplyr)

library(graphics)

head(FEAST_output)

plotname = paste(dir_path,"Result/FEAST_repeat.pdf",sep = "")

pdf(file = plotname,width = 12,height = 12)

par(mfrow=c((length(unique(metadata$SampleType))%/%2 +2 ),2), mar=c(1,1,1,1))

# layouts = as.character(unique(metadata$SampleType))

for (i in 1:length(colnames(FEAST_output))) {

labs <- paste0(row.names(FEAST_output)," (", round(FEAST_output[,i]/sum(FEAST_output[,i])*100,2), "%)")

pie(FEAST_output[,i],labels=labs, init.angle=90,col = brewer.pal(nrow(FEAST_output), "Paired"),#最多可用12种颜色梯度

border="black",main =colnames(FEAST_output)[i] )

}

dev.off()

#简单出图（所有repeat求平均后出1张图）

head(FEAST_output)

asx = as.data.frame(rowMeans(FEAST_output))

asx = as.matrix(asx)

asx_norm = t(t(asx)/colSums(asx)) #* 100 # normalization to total 100

head(asx_norm)

plotname = paste(dir_path,"Result/FEAST_mean.pdf",sep = "")

pdf(file = plotname,width = 6,height = 6)

labs <- paste0(row.names(asx_norm)," (", round(asx_norm[,1]/sum(asx_norm[,1])*100,2), "%)")

pie(asx_norm[,1],labels=labs, init.angle=90,col = brewer.pal(nrow(FEAST_output), "Paired"),#最多12个颜色梯度

border="black",main = "mean of source tracker")

dev.off()

五、实测体会

1.关于速度：听说FEAST比SourceTracker快300倍，实测20个样×2000个otu的数据表计算仅需10分钟（i5, 4G），确实快！

2.关于准确性：采用A、B两组性质类似的数据计算（每组均含20个平行，A=source，B=sink，详见data_files/compare_otu和compare_meta），理论上source结果应接近100%，但实测约为70%~80%。是否准确就见仁见智了~

图。准确性对比

系列教程：微生物组入门 Biostar 微生物组宏基因组

专业技能：学术图表高分文章生信宝典不可或缺的人

一文读懂：宏基因组寄生虫益处进化树

必备技能：提问搜索 Endnote

文献阅读热心肠 SemanticScholar Geenmedical

扩增子分析：图表解读分析流程统计绘图

16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun

在线工具：16S预测培养基生信绘图

科研经验：云笔记云协作公众号

编程模板: Shell R Perl

生物科普: 肠道细菌人体上的生命生命大跃进细胞暗战人体奥秘

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外5000+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份，另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍未解决群内讨论，问题不私聊，帮助同行。

学习16S扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

FEAST：快速准确的微生物来源追溯工具

猜你喜欢

写在后面

您可能也对以下帖子感兴趣

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

生成图片，分享到微信朋友圈

FEAST：快速准确的微生物来源追溯工具

猜你喜欢

写在后面

您可能也对以下帖子感兴趣